(EAI-375): Ingest snooty docs facets and meta #558
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Jira: https://jira.mongodb.org/browse/EAI-375
Changes
meta
andfacets
fields asPage.metadata
Notes
facets.toml
files. There's a DOP ticket to capture thefacets.toml
https://jira.mongodb.org/browse/DOP-5182Experiment Results
Initial experiment
Experiment compares using the ingestion pipeline with the new snooty metadata to the previous baseline. The results can be found here: mongodb-chatbot-retrieval/experiments/mongodb-chatbot-retrieval-snooty-metadata
The results actually show a very slight decrease in search quality as a result of these changes:
Follow up experiment
After upgrading embedding model and preprocessor. Even worse results. Can be seen here in Braintrust.
BinaryNDCG@5 goes from 51.77% -> 46.93%
Next Steps
Based on the results, I think we should not ingest and chunk this metadata. Instead, we should ingest it so it's present in the
pages
collection, but doesn't get included in theembedded_content
. Will create follow-up PR for that.